CROP GROWTH RECOMMENDATIONS : OPTIMAL CONDITIONS FOR HIGHER YIELDS

1.INTRODUCTION

Precision agriculture is in trend nowadays.It assists farmers in making educated decisions about farming strategies.Data in the dataset choosen will help to create a prediction model that would indicate the best crops to cultivate in a certain farm depending on numerous criteria.

Here , From this data I am curious to learn about

Which crops flourish in high temperatures and which in low temperatures?

Which types of crops do the best in which types of soil?

How much rainfall is required for various crops?

2.Data

Dataset Name : crop recommendation

Dataset link : https://www.kaggle.com/datasets/aksahaha/crop-recommendation

Author Name : ABHISHEK KUMAR

Data collection

Data is collected from ICAR(Indian Council of Agriculture Research).And some online search on google.

COLLECTION METHODOLOGY : This information is gathered through speaking with farmers or other agricultural professionals about their experiences cultivating crops under various environmental circumstances.

Cases

The cases are observational.This dataset includes data on the amounts of nitrogen, phosphorous, and potassium in the soil as well as information on temperature, humidity, pH, and rainfall and how these factors affect crop development.

Variables

Nitrogen - Nitrogen content in soil

Phosphorus - Phosphorous content in soil

Potassium - Potassium content in soil

Temperature - Temperature in degree Celsius

Humidity - Relative humidity in %

ph - ph value of the soil

rainfall - Rainfall in mm

Label - Different types of crops

Type of study

It is an observational study

#Loading tidyverse library and importing data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(readxl)
proposal_data<- read.csv("~/Desktop/CHECK/Crop_recommendation.csv")

Data looks as expected.

dim(proposal_data)
## [1] 2200   10

Original data contains 2200 observations and 10 variables.

head(proposal_data)
##   Nitrogen phosphorus potassium temperature humidity       ph rainfall label  X
## 1       90         42        43    20.87974 82.00274 6.502985 202.9355  rice NA
## 2       85         58        41    21.77046 80.31964 7.038096 226.6555  rice NA
## 3       60         55        44    23.00446 82.32076 7.840207 263.9642  rice NA
## 4       74         35        40    26.49110 80.15836 6.980401 242.8640  rice NA
## 5       78         42        42    20.13017 81.60487 7.628473 262.7173  rice NA
## 6       69         37        42    23.05805 83.37012 7.073454 251.0550  rice NA
##   X.1
## 1  NA
## 2  NA
## 3  NA
## 4  NA
## 5  NA
## 6  NA

Checkig first 6 rows of data to get an idea about what variables are present.

3. Data Quality

is.data.frame(proposal_data)
## [1] TRUE
proposal_data2 <- as_tibble(proposal_data)
is_tibble(proposal_data2)
## [1] TRUE

As above dataset is in the form of data frame. so, converting it to tibble form.

str(proposal_data2)
## tibble [2,200 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Nitrogen   : int [1:2200] 90 85 60 74 78 69 69 94 89 68 ...
##  $ phosphorus : int [1:2200] 42 58 55 35 42 37 55 53 54 58 ...
##  $ potassium  : int [1:2200] 43 41 44 40 42 42 38 40 38 38 ...
##  $ temperature: num [1:2200] 20.9 21.8 23 26.5 20.1 ...
##  $ humidity   : num [1:2200] 82 80.3 82.3 80.2 81.6 ...
##  $ ph         : num [1:2200] 6.5 7.04 7.84 6.98 7.63 ...
##  $ rainfall   : num [1:2200] 203 227 264 243 263 ...
##  $ label      : chr [1:2200] "rice" "rice" "rice" "rice" ...
##  $ X          : logi [1:2200] NA NA NA NA NA NA ...
##  $ X.1        : logi [1:2200] NA NA NA NA NA NA ...

Checking detailed information about data.

sum(is.na(proposal_data2))
## [1] 4400

There are some missing values in data.

colSums(is.na(proposal_data2))
##    Nitrogen  phosphorus   potassium temperature    humidity          ph 
##           0           0           0           0           0           0 
##    rainfall       label           X         X.1 
##           0           0        2200        2200

From above we can see there are two columns given with no information.

clean_data <- proposal_data2[c(1:2200),-c(9,10)]
sum(is.na(clean_data))
## [1] 0
dim(clean_data)
## [1] 2200    8

In the above step, I Removed the extra columns with no data and only considering columns with data. And observe the data after removing extra columns we have only 8 variables.

sum(duplicated(clean_data))
## [1] 0

Duplicate values are not present in data.

summary(clean_data)
##     Nitrogen        phosphorus       potassium       temperature    
##  Min.   :  0.00   Min.   :  5.00   Min.   :  5.00   Min.   : 8.826  
##  1st Qu.: 21.00   1st Qu.: 28.00   1st Qu.: 20.00   1st Qu.:22.769  
##  Median : 37.00   Median : 51.00   Median : 32.00   Median :25.599  
##  Mean   : 50.55   Mean   : 53.36   Mean   : 48.15   Mean   :25.616  
##  3rd Qu.: 84.25   3rd Qu.: 68.00   3rd Qu.: 49.00   3rd Qu.:28.562  
##  Max.   :140.00   Max.   :145.00   Max.   :205.00   Max.   :43.675  
##     humidity           ph           rainfall         label          
##  Min.   :14.26   Min.   :3.505   Min.   : 20.21   Length:2200       
##  1st Qu.:60.26   1st Qu.:5.972   1st Qu.: 64.55   Class :character  
##  Median :80.47   Median :6.425   Median : 94.87   Mode  :character  
##  Mean   :71.48   Mean   :6.469   Mean   :103.46                     
##  3rd Qu.:89.95   3rd Qu.:6.924   3rd Qu.:124.27                     
##  Max.   :99.98   Max.   :9.935   Max.   :298.56

Summarizing the data to check mean,median and quartile of variables.

clean_data %>% glimpse()
## Rows: 2,200
## Columns: 8
## $ Nitrogen    <int> 90, 85, 60, 74, 78, 69, 69, 94, 89, 68, 91, 90, 78, 93, 94…
## $ phosphorus  <int> 42, 58, 55, 35, 42, 37, 55, 53, 54, 58, 53, 46, 58, 56, 50…
## $ potassium   <int> 43, 41, 44, 40, 42, 42, 38, 40, 38, 38, 40, 42, 44, 36, 37…
## $ temperature <dbl> 20.87974, 21.77046, 23.00446, 26.49110, 20.13017, 23.05805…
## $ humidity    <dbl> 82.00274, 80.31964, 82.32076, 80.15836, 81.60487, 83.37012…
## $ ph          <dbl> 6.502985, 7.038096, 7.840207, 6.980401, 7.628473, 7.073454…
## $ rainfall    <dbl> 202.9355, 226.6555, 263.9642, 242.8640, 262.7173, 251.0550…
## $ label       <chr> "rice", "rice", "rice", "rice", "rice", "rice", "rice", "r…

The above data has 7 numerical variables and one categorical variable.

clean_data %>% group_by(label) %>% summarise(count=n())
## # A tibble: 22 × 2
##    label       count
##    <chr>       <int>
##  1 apple         100
##  2 banana        100
##  3 blackgram     100
##  4 chickpea      100
##  5 coconut       100
##  6 coffee        100
##  7 cotton        100
##  8 grapes        100
##  9 jute          100
## 10 kidneybeans   100
## # ℹ 12 more rows

4.EDA(EXPLORATORY DATA ANALYSIS)

print(paste('Standard Deviation of Rainfall: ',sd(clean_data$rainfall)))
## [1] "Standard Deviation of Rainfall:  54.9583885248781"
print(paste('Standard Deviation of Temperature: ',sd(clean_data$temperature)))
## [1] "Standard Deviation of Temperature:  5.06374859995884"
print(paste('Standard Deviation of Humidity: ',sd(clean_data$humidity)))
## [1] "Standard Deviation of Humidity:  22.2638115897611"
print(paste('Standard Deviation of ph: ',sd(clean_data$ph)))
## [1] "Standard Deviation of ph:  0.773937688029873"
print(paste('Standard Deviation of Nitrogen: ',sd(clean_data$Nitrogen)))
## [1] "Standard Deviation of Nitrogen:  36.9173338337566"
print(paste('Standard Deviation of Phosphorous: ',sd(clean_data$phosphorus)))
## [1] "Standard Deviation of Phosphorous:  32.9858827385872"
print(paste('Standard Deviation of Potassium: ',sd(clean_data$potassium)))
## [1] "Standard Deviation of Potassium:  50.6479305466601"
print(paste('Variance of Rainfall: ',var(clean_data$rainfall)))
## [1] "Variance of Rainfall:  3020.42446925146"
print(paste('Variance of Temperature: ',var(clean_data$temperature)))
## [1] "Variance of Temperature:  25.6415498835851"
print(paste('Variance of Humidity: ',var(clean_data$humidity)))
## [1] "Variance of Humidity:  495.67730650438"
print(paste('Variance of ph: ',var(clean_data$ph)))
## [1] "Variance of ph:  0.598979544953026"
print(paste('Variance of Nitrogen: ',var(clean_data$Nitrogen)))
## [1] "Variance of Nitrogen:  1362.88953739303"
print(paste('Variance of Phosphorous: ',var(clean_data$phosphorus)))
## [1] "Variance of Phosphorous:  1088.06846004382"
print(paste('Variance of Potassium: ',var(clean_data$potassium)))
## [1] "Variance of Potassium:  2565.21286865931"
ggplot(data = clean_data) +
  geom_bar(mapping = aes(x = label,colour="label"))+coord_flip()

Result- From the above visualization we can see count of samples collected for each crop.

clean_data %>%  group_by(label) %>% summarise(avg = mean(rainfall)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "Rainfall level", x = "CROPS", y = "AVERAGE") +coord_flip()

Result- From the above visualization, we can observe which crops need high level of rainfall and which crops doesn’t need much rainfall. For example, Rice and Coconut need high rainfall whereas muskmelon needs very less rainfall to grow. Question from proposal is answered.

clean_data %>%  group_by(label) %>% summarise(avg = mean(Nitrogen)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "Nitrogen content in land", x = "CROPS", y = "AVERAGE") +coord_flip()

Result- From the above visualization, we can observe which crops need high amount of Nitrogen and which crops doesn’t need much Nitrogen in soil. For example, Cotton and Coffee need high Nitrogen whereas lentil needs very less amount of Nitrogen in soil to grow. So, we can say few crops like cotton,coffee,muskmelon,banana,watermelon and rice will grow well in soil with high amount of nitrogen.

clean_data %>%  group_by(label) %>% summarise(avg = mean(potassium)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "Potassium content in land", x = "CROPS", y = "AVERAGE") +coord_flip()

Result - From the above visualization we can observe most of the crops doesn’t need much potassium in soil to grow. However, Grapes and Apple need very high amount of potassium in soil to grow.

clean_data %>%  group_by(label) %>% summarise(avg = mean(phosphorus)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "phosphorous content in land", x = "CROPS", y = "AVERAGE") +coord_flip()

Result- From the above visualization we can see most of the crops need medium amount of phosphorous in soil to grow. Whereas Apple and grapes need high amount of phosphorous in soil to grow and Orange,Coconut,Watermelon,Muskmelon,Pomegranate need very less amount of phosphorous.

clean_data %>%  group_by(label) %>% summarise(avg = mean(temperature)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "Temperature", x = "CROPS", y = "AVERAGE") +coord_flip()

Result- From the above visualization we can see few crops like Papaya,Mango,Blackgam,Muskmelon and mungbean will grow in high temperatures. Whereas, No crops above have temperature mean under 18 degrees. So, we can say crops mentioned in data will grow only in high or medium temperatures. Question from proposal is answered through above visual representation.

clean_data %>%  group_by(label) %>% summarise(avg = mean(humidity)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "HUMIDITY", x = "CROPS", y = "AVERAGE") +coord_flip()

Result- From the above visualization, we can see most of the crops need high amount of humidity that means moisture in air.Whereas Chickpea,Kidneybeans,Pigeon peas and Mango will need very less amount of moisture.

clean_data %>%  group_by(label) %>% summarise(avg = mean(ph)) %>% ggplot(mapping = aes(x = reorder(label,avg), y=avg,fill=label,space=1))+
geom_bar(stat="identity",width=0.7, position = position_dodge(width=0.5))+labs(title = "ph of crop field", x = "CROPS", y = "AVERAGE") +coord_flip()

Result- From the above visualization, we can see most of the crops in data needs same amount of ph content in soil.

Step - Calculating mean of all samples and grouping to see particularly what amount of climatic conditions and land conditions should be present for a crop to grow.

SETA <- clean_data %>% group_by(label) %>% 
  summarise(mean_N=mean(Nitrogen),mean_ph=mean(ph),mean_T=mean(temperature),mean_ph=mean(phosphorus),mean_K=mean(potassium),mean_R=mean(rainfall),mean_H=mean(humidity),
            .groups = 'drop')
print(SETA)
## # A tibble: 22 × 7
##    label       mean_N mean_ph mean_T mean_K mean_R mean_H
##    <chr>        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 apple         20.8   134.    22.6  200.   113.    92.3
##  2 banana       100.     82.0   27.4   50.0  105.    80.4
##  3 blackgram     40.0    67.5   30.0   19.2   67.9   65.1
##  4 chickpea      40.1    67.8   18.9   79.9   80.1   16.9
##  5 coconut       22.0    16.9   27.4   30.6  176.    94.8
##  6 coffee       101.     28.7   25.5   29.9  158.    58.9
##  7 cotton       118.     46.2   24.0   19.6   80.4   79.8
##  8 grapes        23.2   133.    23.8  200.    69.6   81.9
##  9 jute          78.4    46.9   25.0   40.0  175.    79.6
## 10 kidneybeans   20.8    67.5   20.1   20.0  106.    21.6
## # ℹ 12 more rows

Result- Requirement mentioned in the above step is satisfied as we can clearly see what level of temperature, rainfall, humidity and what amount of nitrogen ,potassium, phosphorous,and ph is need for each crop to grow.

Step- Converting wider data set in to longer data set to get proper visualization of data.

pivoted_data <- SETA %>% pivot_longer(mean_N:mean_H,names_to = "land_condition",values_to = "Amount")
print(pivoted_data)
## # A tibble: 132 × 3
##    label  land_condition Amount
##    <chr>  <chr>           <dbl>
##  1 apple  mean_N           20.8
##  2 apple  mean_ph         134. 
##  3 apple  mean_T           22.6
##  4 apple  mean_K          200. 
##  5 apple  mean_R          113. 
##  6 apple  mean_H           92.3
##  7 banana mean_N          100. 
##  8 banana mean_ph          82.0
##  9 banana mean_T           27.4
## 10 banana mean_K           50.0
## # ℹ 122 more rows

Result- Requirement mentioned in the above step is satisfied as we can clearly see conversion of data set different soil and climatic conditions are mentioned in the column land_condition and amount or level required is mentioned under column Amount.

sum(is.na(pivoted_data))
## [1] 0

Result- After conversion also there are no na values present.

pivoted_data %>%  group_by(label)  %>% ggplot() + geom_bar(mapping = aes(x = label, y = Amount,color =land_condition), stat = "identity")+coord_flip()

Result- In the above visual representation, We can clearly see land condition of each crop, all conditions for a crop to grow healthily are showed with amount by differentiating each condition with colour.Each bar above represents a condition.

ggplot(data = pivoted_data) +geom_point(mapping = aes(x=Amount,y=label,colour =land_condition),alpha = 0.7,show.legend = TRUE)

Result - Same representation is done using points and differentiating conditions with colour to get more clear picture of what amount of each condition is required.

ggplot(data = SETA) +geom_point(mapping = aes(x=label,y=mean_H,colour =mean_N ))+coord_flip()

Result- From the above visual representation we can say there is no proper relation between nitrogen and humidity. In some cases where nitrogen percentage is high in land humidity is more.

ggplot(data = SETA) +geom_point(mapping = aes(x=mean_R,y=label,colour =mean_T))

Result- Above graph is plotted to understand the relation between rainfall and temperature, whether temperature is high or low, certain amount of rainfall is needed for each crop.so we can say there is no relation between temperature and rainfall.

ggplot(data=pivoted_data)+geom_line(mapping=aes(x=land_condition,y=Amount,color=label),group =1)

Result- In the above graph, conditions for different crops is mentioned using lines but representation is not clear as there are many number of crops.

Step - As above representation is not clear, we are dividing the data in to subsets.

#To make data more efficient dividing data in to subsets
subsetF<- filter(SETA, label %in% c("apple","banana","blackgram","chickpea","coconut","coffee","cotton","grapes","jute","kidneybeans","lentil"))
subsetF
## # A tibble: 11 × 7
##    label       mean_N mean_ph mean_T mean_K mean_R mean_H
##    <chr>        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 apple         20.8   134.    22.6  200.   113.    92.3
##  2 banana       100.     82.0   27.4   50.0  105.    80.4
##  3 blackgram     40.0    67.5   30.0   19.2   67.9   65.1
##  4 chickpea      40.1    67.8   18.9   79.9   80.1   16.9
##  5 coconut       22.0    16.9   27.4   30.6  176.    94.8
##  6 coffee       101.     28.7   25.5   29.9  158.    58.9
##  7 cotton       118.     46.2   24.0   19.6   80.4   79.8
##  8 grapes        23.2   133.    23.8  200.    69.6   81.9
##  9 jute          78.4    46.9   25.0   40.0  175.    79.6
## 10 kidneybeans   20.8    67.5   20.1   20.0  106.    21.6
## 11 lentil        18.8    68.4   24.5   19.4   45.7   64.8
#To make data more efficient dividing data in to subsets
subsetG<- filter(SETA, label %in% c("maize","mango","mothbeans","mungbean","muskmelon","orange","papaya","pigeonpeas","pomegranate","rice","watermelon"))
subsetG
## # A tibble: 11 × 7
##    label       mean_N mean_ph mean_T mean_K mean_R mean_H
##    <chr>        <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 maize         77.8    48.4   22.4   19.8   84.8   65.1
##  2 mango         20.1    27.2   31.2   29.9   94.7   50.2
##  3 mothbeans     21.4    48.0   28.2   20.2   51.2   53.2
##  4 mungbean      21.0    47.3   28.5   19.9   48.4   85.5
##  5 muskmelon    100.     17.7   28.7   50.1   24.7   92.3
##  6 orange        19.6    16.6   22.8   10.0  110.    92.2
##  7 papaya        49.9    59.0   33.7   50.0  143.    92.4
##  8 pigeonpeas    20.7    67.7   27.7   20.3  149.    48.1
##  9 pomegranate   18.9    18.8   21.8   40.2  108.    90.1
## 10 rice          79.9    47.6   23.7   39.9  236.    82.3
## 11 watermelon    99.4    17     25.6   50.2   50.8   85.2

Result- So, we divided data in to two subsets F an G. Each dataset contains 11 crops.

pivoted_dataA <- subsetF %>% pivot_longer(mean_N:mean_H,names_to = "land_condition",values_to = "Amount")
print(pivoted_dataA)
## # A tibble: 66 × 3
##    label  land_condition Amount
##    <chr>  <chr>           <dbl>
##  1 apple  mean_N           20.8
##  2 apple  mean_ph         134. 
##  3 apple  mean_T           22.6
##  4 apple  mean_K          200. 
##  5 apple  mean_R          113. 
##  6 apple  mean_H           92.3
##  7 banana mean_N          100. 
##  8 banana mean_ph          82.0
##  9 banana mean_T           27.4
## 10 banana mean_K           50.0
## # ℹ 56 more rows
pivoted_dataB <- subsetG %>% pivot_longer(mean_N:mean_H,names_to = "land_condition",values_to = "Amount")
print(pivoted_dataB)
## # A tibble: 66 × 3
##    label land_condition Amount
##    <chr> <chr>           <dbl>
##  1 maize mean_N           77.8
##  2 maize mean_ph          48.4
##  3 maize mean_T           22.4
##  4 maize mean_K           19.8
##  5 maize mean_R           84.8
##  6 maize mean_H           65.1
##  7 mango mean_N           20.1
##  8 mango mean_ph          27.2
##  9 mango mean_T           31.2
## 10 mango mean_K           29.9
## # ℹ 56 more rows

Result- Converted wider data subsets in to longer data subsets.

ggplot(data=pivoted_dataA)+geom_line(mapping=aes(x=Amount, y=land_condition, group=label, color=land_condition))+ facet_wrap(~ label)

Result - From the above visual representation we can clearly see no two groups are matching in every condition. For suppose lentil, grapes and apple have same temperature but in remaining conditions they are not similar. Question from proposal is answered.

ggplot(data=pivoted_dataB)+geom_line(mapping=aes(x=Amount, y=land_condition, group=label, color=land_condition))+ facet_wrap(~ label)

Result - From the above visual representation, we can clearly see mothbeans and mungbeans requires almost same climatic and land conditions.Whereas remaining crops have totally different land and climatic conditions.

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(subsetF)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Result- From the above visual representation, we can observe both positive and negative correlation between variables. for suppose temperature and nitrogen have positive correlation, whereas ph and nitrogen have negative correlation.

ggplot(data =  pivoted_dataA, mapping = aes(x = Amount, y = land_condition,group=label,colour=label)) + geom_boxplot() 

Result- This visualization is done to check which two crops having similar conditions, all the conditions of jute and blackgram are in similar range but slightly varies, same with grapes and apple. so we can say for almost all crops conditions varies in one or other.

ggplot(data =  pivoted_dataB, mapping = aes(x = Amount, y = land_condition,group=label,colour=label)) + geom_boxplot() 

5.HYPOTHESIS TESTING

Step - A sample of 2 crops, calculating mean , standard deviations to perform two tailed test on rainfall for apple and banana .

Null hypothesis: The mean of apple rainfall is equal to banana rainfall.

Alternative hypothesis: The mean of apple rainfall is not equal to banana rainfall.

Apple <- clean_data$rainfall[clean_data$label == "apple"]

Banana <- clean_data$rainfall[clean_data$label == "banana"]
print(Mean1 <- mean(Apple))
## [1] 112.6548
print(Mean2 <- mean(Banana))
## [1] 104.627
sd1 <- sd(Apple)
print(paste('Standard Deviation of Apple rainfall: ', sd1))
## [1] "Standard Deviation of Apple rainfall:  7.10298539071806"
round(sd1,digits = 2)
## [1] 7.1
sd2 <- sd(Banana)
print(paste('Standard Deviation of Banana rainfall: ', sd2))
## [1] "Standard Deviation of Banana rainfall:  9.39814957319825"
round(sd2,digits = 2)
## [1] 9.4
n<-2200
SE <- sqrt((sd1^2/n) + (sd2^2/n))
SE
## [1] 0.2511588

As it is two sample t- test alpha value is divided in to two that is 0.05/2 = 0.025 and zscore is taken from t- table based on that

X1 <- Mean1
X2 <- Mean2
t <- (X1-X2)/SE
alpha = 0.05 
zscore <- 1.96
#As it is two tailed test, if Z is less than -1.96 or if Z is greater than 1.96 we reject Null Hypothesis.
t<-round(t,digits = 2)
t
## [1] 31.96
print(degreesoffreedom <- (n+n)-2)
## [1] 4398

Result - From t- statistics we can conclude that t value is greater than 1.96 and also p- value is less than alpha value (Refer to references - Using a t-distribution table you can find that the p-value is much less than 0.001 (p < 0.001).), so we are rejecting null hypothesis.

Step : So ,people from particular place claimed that ph ranges from 6.47 and 6.48 and mean of ph is 6.469 from our data .

Null hypothesis: The mean ph is equal to 6.48.

Alternative hypothesis: The mean ph is less than 6.48.

alpha = 0.05
zscore = 1.64
t.test(clean_data$ph, mu = 6.48, alternative = "less")
## 
##  One Sample t-test
## 
## data:  clean_data$ph
## t = -0.63756, df = 2199, p-value = 0.2619
## alternative hypothesis: true mean is less than 6.48
## 95 percent confidence interval:
##      -Inf 6.496632
## sample estimates:
## mean of x 
##   6.46948

Result - Here, true mean is less than 6.48 and p value is greater than alpha value. so , in this case null hypothesis is accepted.

6.Linear Regression

plot(x = SETA$mean_T,y = SETA$mean_R,
   xlab = "Temperature",
   ylab = "Rainfall",
   main = "Temperature vs Rainfall"
)

Result - No relation is seen among the variables.

ggplot(SETA, aes(x = mean_T,y = mean_R,)) + geom_point(color= "red") + geom_smooth(method = "lm",color="blue")
## `geom_smooth()` using formula = 'y ~ x'

B <- lm(SETA$mean_R~SETA$mean_T)
summary(B)
## 
## Call:
## lm(formula = SETA$mean_R ~ SETA$mean_T)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -76.66 -34.45  -3.12  35.38 131.38 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.2623    82.2271   1.475    0.156
## SETA$mean_T  -0.6948     3.1793  -0.219    0.829
## 
## Residual standard error: 53.18 on 20 degrees of freedom
## Multiple R-squared:  0.002382,   Adjusted R-squared:  -0.0475 
## F-statistic: 0.04776 on 1 and 20 DF,  p-value: 0.8292

Result From the regression we get the equation yˆ=121.2623-0.06948x, where β0 = 121.2623 and β1 =-0.06948 .

Summary_Data <- summary(B)  # capture model summary as an object
CoefficientsB <- Summary_Data$coefficients  # model coefficients
beta.estimate <- CoefficientsB["SETA$mean_T", "Estimate"]  # get beta estimate for speed
standard_error <- CoefficientsB["SETA$mean_T", "Std. Error"]  # get std.error for speed
t_value <- beta.estimate/standard_error  # calc t statistic
t_value
## [1] -0.2185447
qt(p = .025, df = 20)
## [1] -2.085963

Result - When |t|>=|t_0.05,20|, we reject the null hypothesis at 5% significance level. We get |t|>=|t_0.05,20| as |-0.2185447|>|-2.085963| from the t-distribution output. As a result, the Null Hypothesis is rejected, and we infer that the regression line fitted to the data is significant.

qt(p = .005, df = 20)
## [1] -2.84534

Result - When |t|>=|t_0.01,20|, we reject the null hypothesis at 1% significance level. We get |t|>=|t_0.05,20| as |-0.2185447|>|-2.84534| from the t-distribution output. As a result, the Null Hypothesis is rejected, and we infer that the regression line fitted to the data is significant.

cor(SETA$mean_T,SETA$mean_R)
## [1] -0.04880984
confint(B, level=.95)
##                  2.5 %    97.5 %
## (Intercept) -50.260415 292.78496
## SETA$mean_T  -7.326705   5.93707
cbind(B$residuals, B$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = B$residuals, x = B$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values") +
theme(text = element_text(size = 16))+geom_hline(yintercept = 0)

SETA_sqrt <- sqrt(SETA$mean_R)
D1 <- lm(SETA_sqrt~SETA$mean_T)
summary(D1)
## 
## Call:
## lm(formula = SETA_sqrt ~ SETA$mean_T)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7315 -1.5590  0.1505  2.0042  5.4013 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 11.23613    4.01645   2.798   0.0111 *
## SETA$mean_T -0.05358    0.15530  -0.345   0.7337  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.598 on 20 degrees of freedom
## Multiple R-squared:  0.005916,   Adjusted R-squared:  -0.04379 
## F-statistic: 0.119 on 1 and 20 DF,  p-value: 0.7337

Result - From the above we get the equation yˆ= 11.23613-0.05358 x

cbind(D1$residuals, D1$fitted.values) %>%
as.data.frame() %>%
ggplot(aes(y = D1$residuals, x = D1$fitted.values)) +
geom_point() + labs(y = "Residuals", x = "Fitted Values") +
theme(text = element_text(size = 16))+geom_hline(yintercept = 0)

There is no much difference in spread of fitted values and residuals but we can observe increase in residuals.

7.PCA

new_data <- clean_data[1:7]
scaled_data = scale(new_data)
head(scaled_data)
##       Nitrogen  phosphorus   potassium temperature  humidity         ph
## [1,] 1.0685545 -0.34447243 -0.10166439  -0.9353743 0.4725590 0.04329189
## [2,] 0.9331167  0.14058356 -0.14115268  -0.7594734 0.3969610 0.73470553
## [3,] 0.2559281  0.04963556 -0.08192025  -0.5157809 0.4868431 1.77110780
## [4,] 0.6351537 -0.55668443 -0.16089682   0.1727678 0.3897169 0.66015759
## [5,] 0.7435039 -0.34447243 -0.12140853  -1.0834008 0.4546883 1.49752731
## [6,] 0.4997160 -0.49605243 -0.12140853  -0.5051979 0.5339759 0.78039027
##      rainfall
## [1,] 1.809949
## [2,] 2.241548
## [3,] 2.920402
## [4,] 2.536471
## [5,] 2.897714
## [6,] 2.685511
Data_A <-as.matrix(new_data)
CovarianceMatrix = cov(Data_A)
print("CovarianceMatrix")
## [1] "CovarianceMatrix"
CovarianceMatrix
##                Nitrogen  phosphorus  potassium  temperature   humidity
## Nitrogen    1362.889537 -281.860096 -262.72715   4.95462225 156.730700
## phosphorus  -281.860096 1088.068460 1229.99865 -21.30347754 -87.197323
## potassium   -262.727147 1229.998647 2565.21287 -41.13422930 215.215502
## temperature    4.954622  -21.303478  -41.13423  25.64154988  23.147400
## humidity     156.730700  -87.197323  215.21550  23.14740049 495.677307
## ph             2.762395   -3.523487   -6.64424  -0.06973913  -0.146161
## rainfall     119.747146 -115.730685 -148.81121  -8.37217973 115.534462
##                      ph    rainfall
## Nitrogen     2.76239482  119.747146
## phosphorus  -3.52348679 -115.730685
## potassium   -6.64424046 -148.811212
## temperature -0.06973913   -8.372180
## humidity    -0.14616095  115.534462
## ph           0.59897954   -4.639202
## rainfall    -4.63920157 3020.424469

Result - Covariance of vectors is calcualted to observe how one variable differe from another, that is to observe the relation between one value and other value. Here, I observed both positive and negative covariances in data

Step - Correlation is calculated below to observe to check whether there is any trend between variables

CorrelationMatrix = cor(Data_A)
print("CorrelationMatrix")
## [1] "CorrelationMatrix"
CorrelationMatrix
##                Nitrogen  phosphorus   potassium temperature     humidity
## Nitrogen     1.00000000 -0.23145958 -0.14051184  0.02650380  0.190688379
## phosphorus  -0.23145958  1.00000000  0.73623222 -0.12754113 -0.118734116
## potassium   -0.14051184  0.73623222  1.00000000 -0.16038713  0.190858861
## temperature  0.02650380 -0.12754113 -0.16038713  1.00000000  0.205319677
## humidity     0.19068838 -0.11873412  0.19085886  0.20531968  1.000000000
## ph           0.09668285 -0.13801889 -0.16950310 -0.01779502 -0.008482539
## rainfall     0.05902022 -0.06383905 -0.05346135 -0.03008378  0.094423053
##                       ph    rainfall
## Nitrogen     0.096682846  0.05902022
## phosphorus  -0.138018893 -0.06383905
## potassium   -0.169503098 -0.05346135
## temperature -0.017795017 -0.03008378
## humidity    -0.008482539  0.09442305
## ph           1.000000000 -0.10906948
## rainfall    -0.109069484  1.00000000
transversecov <- t(CovarianceMatrix)
transversecov
##                Nitrogen  phosphorus  potassium  temperature   humidity
## Nitrogen    1362.889537 -281.860096 -262.72715   4.95462225 156.730700
## phosphorus  -281.860096 1088.068460 1229.99865 -21.30347754 -87.197323
## potassium   -262.727147 1229.998647 2565.21287 -41.13422930 215.215502
## temperature    4.954622  -21.303478  -41.13423  25.64154988  23.147400
## humidity     156.730700  -87.197323  215.21550  23.14740049 495.677307
## ph             2.762395   -3.523487   -6.64424  -0.06973913  -0.146161
## rainfall     119.747146 -115.730685 -148.81121  -8.37217973 115.534462
##                      ph    rainfall
## Nitrogen     2.76239482  119.747146
## phosphorus  -3.52348679 -115.730685
## potassium   -6.64424046 -148.811212
## temperature -0.06973913   -8.372180
## humidity    -0.14616095  115.534462
## ph           0.59897954   -4.639202
## rainfall    -4.63920157 3020.424469
Orthogonal_check = CovarianceMatrix%*%transversecov
Orthogonal_check
##                 Nitrogen  phosphorus   potassium  temperature   humidity
## Nitrogen     2044874.629 -1041621.49 -1363017.68   26316.4970 273278.179
## phosphorus  -1041621.492  2797697.98  4566938.91  -76766.6804  68576.767
## potassium   -1363017.675  4566938.91  8232437.95 -127849.7451 492177.013
## temperature    26316.497   -76766.68  -127849.75    3433.8007   4881.335
## humidity      273278.179    68576.77   492177.01    4881.3351 338065.626
## ph              5926.462   -12235.79   -21445.73     395.6818  -1299.891
## rainfall      614659.449  -702147.83  -979774.86  -13647.5637 402870.793
##                       ph   rainfall
## Nitrogen      5926.46170  614659.45
## phosphorus  -12235.79248 -702147.83
## potassium   -21445.73322 -979774.86
## temperature    395.68185  -13647.56
## humidity     -1299.89101  402870.79
## ph              86.09891  -12304.14
## rainfall    -12304.13759 9186281.55
EigenValuescovariance = eigen(CovarianceMatrix)
EigenValuescovariance
## eigen() decomposition
## $values
## [1] 3434.457714 2933.051775 1349.006967  566.260449  252.160060   23.008944
## [7]    0.567262
## 
## $vectors
##              [,1]         [,2]         [,3]          [,4]         [,5]
## [1,]  0.180735334 -0.028231188  0.946431143  0.2656764987 -0.013517602
## [2,] -0.440880097  0.207408888 -0.057190143  0.5601808540  0.667049900
## [3,] -0.758764968  0.393266840  0.214289728 -0.2272182955 -0.413334754
## [4,]  0.010980213 -0.009163157  0.002162218 -0.0349757509  0.075558593
## [5,] -0.015315872  0.067715774  0.224545325 -0.7493652958  0.614766109
## [6,]  0.001466901 -0.002582226  0.001242598  0.0003892233  0.001401021
## [7,]  0.443709263  0.892664376 -0.068194347  0.0348346648 -0.019174321
##              [,6]          [,7]
## [1,]  0.006060655 -0.0015397910
## [2,] -0.024029262 -0.0001324106
## [3,]  0.034854411  0.0028681784
## [4,]  0.996377177  0.0095348819
## [5,] -0.072607445 -0.0013559969
## [6,] -0.009704231  0.9999466735
## [7,]  0.006127128  0.0018117827

Result - Calculating eigen value for covariance, from that we can see eigen composition is divided in to values and vectors

Valuescov <- eigen(CovarianceMatrix)$values
is.nan(Valuescov)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Valuescov
## [1] 3434.457714 2933.051775 1349.006967  566.260449  252.160060   23.008944
## [7]    0.567262
Vectorscov <-eigen(CovarianceMatrix)$vectors
Vectorscov 
##              [,1]         [,2]         [,3]          [,4]         [,5]
## [1,]  0.180735334 -0.028231188  0.946431143  0.2656764987 -0.013517602
## [2,] -0.440880097  0.207408888 -0.057190143  0.5601808540  0.667049900
## [3,] -0.758764968  0.393266840  0.214289728 -0.2272182955 -0.413334754
## [4,]  0.010980213 -0.009163157  0.002162218 -0.0349757509  0.075558593
## [5,] -0.015315872  0.067715774  0.224545325 -0.7493652958  0.614766109
## [6,]  0.001466901 -0.002582226  0.001242598  0.0003892233  0.001401021
## [7,]  0.443709263  0.892664376 -0.068194347  0.0348346648 -0.019174321
##              [,6]          [,7]
## [1,]  0.006060655 -0.0015397910
## [2,] -0.024029262 -0.0001324106
## [3,]  0.034854411  0.0028681784
## [4,]  0.996377177  0.0095348819
## [5,] -0.072607445 -0.0013559969
## [6,] -0.009704231  0.9999466735
## [7,]  0.006127128  0.0018117827
EigenValuescorrelation = eigen(CorrelationMatrix)
EigenValuescorrelation
## eigen() decomposition
## $values
## [1] 1.9312182 1.2939102 1.0765093 1.0228912 0.8059284 0.6765616 0.1929812
## 
## $vectors
##             [,1]        [,2]       [,3]        [,4]        [,5]        [,6]
## [1,]  0.30219096 -0.33410693 -0.1120450  0.54165059  0.50778466  0.48290443
## [2,] -0.64378667 -0.03435809 -0.1099391  0.04629318 -0.08233115  0.37684700
## [3,] -0.62260719 -0.28382920 -0.1631733  0.15486709 -0.03342452  0.02896707
## [4,]  0.21242839 -0.35948683 -0.2482280 -0.69082649 -0.15486542  0.50041798
## [5,]  0.06848339 -0.73791663 -0.2135991  0.06717140 -0.12887133 -0.54787098
## [6,]  0.22694272  0.22065738 -0.5485203  0.39570047 -0.65188053  0.12571195
## [7,]  0.07253163 -0.29015800  0.7352670  0.20531846 -0.51838188  0.23992979
##              [,7]
## [1,]  0.008472888
## [2,]  0.649104376
## [3,] -0.692268474
## [4,] -0.111281619
## [5,]  0.289624027
## [6,] -0.040027859
## [7,] -0.038576857

Result - Calculating eigen value for correlation, from that we can see eigen composition is divided in to values and vectors

Valuescor <- eigen(CorrelationMatrix)$values
is.nan(Valuescor)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Valuescor
## [1] 1.9312182 1.2939102 1.0765093 1.0228912 0.8059284 0.6765616 0.1929812
Vectorscor <- eigen(CorrelationMatrix)$vectors
Vectorscor 
##             [,1]        [,2]       [,3]        [,4]        [,5]        [,6]
## [1,]  0.30219096 -0.33410693 -0.1120450  0.54165059  0.50778466  0.48290443
## [2,] -0.64378667 -0.03435809 -0.1099391  0.04629318 -0.08233115  0.37684700
## [3,] -0.62260719 -0.28382920 -0.1631733  0.15486709 -0.03342452  0.02896707
## [4,]  0.21242839 -0.35948683 -0.2482280 -0.69082649 -0.15486542  0.50041798
## [5,]  0.06848339 -0.73791663 -0.2135991  0.06717140 -0.12887133 -0.54787098
## [6,]  0.22694272  0.22065738 -0.5485203  0.39570047 -0.65188053  0.12571195
## [7,]  0.07253163 -0.29015800  0.7352670  0.20531846 -0.51838188  0.23992979
##              [,7]
## [1,]  0.008472888
## [2,]  0.649104376
## [3,] -0.692268474
## [4,] -0.111281619
## [5,]  0.289624027
## [6,] -0.040027859
## [7,] -0.038576857
Squareroot <- Vectorscov %*% diag(sqrt(Valuescov)) %*% t(Vectorscov)
Squareroot
##             [,1]        [,2]       [,3]        [,4]         [,5]         [,6]
## [1,] 36.53938440 -3.57722281 -2.5357817 -0.00291165  2.668100755  0.063397804
## [2,] -3.57722281 28.37690443 16.1607979 -0.17189899 -2.784237689 -0.048469178
## [3,] -2.53578166 16.16079785 47.7497771 -0.80661723  3.895165740 -0.121208300
## [4,] -0.00291165 -0.17189899 -0.8066172  4.89369222  0.988652472 -0.035518176
## [5,]  2.66810076 -2.78423769  3.8951657  0.98865247 21.503469040  0.008556504
## [6,]  0.06339780 -0.04846918 -0.1212083 -0.03551818  0.008556504  0.754118104
## [7,]  1.18887505 -1.03342478 -1.3162204 -0.18558547  1.502518400 -0.088829509
##             [,7]
## [1,]  1.18887505
## [2,] -1.03342478
## [3,] -1.31622036
## [4,] -0.18558547
## [5,]  1.50251840
## [6,] -0.08882951
## [7,] 54.89909606

Result - Observed the square root of covariances eigen values with the help of spectral decomposition method and also here I can see both positive and negative values

Percentage_Variance<-EigenValuescovariance$values / sum(EigenValuescorrelation$values)
# Percent variance explained
Percentage_Variance
## [1] 490.63681627 419.00739648 192.71528106  80.89434981  36.02286571
## [6]   3.28699199   0.08103743
cumsum(Percentage_Variance)
## [1]  490.6368  909.6442 1102.3595 1183.2538 1219.2767 1222.5637 1222.6447
plot(Percentage_Variance)

plot(cumsum((Percentage_Variance)))

From the above graph, we can observe after the seventh value, plot is stable. So key values can be taken as 1:4. From Variabilitycor we can observe range also

So, Here we considered 1-4 columns as the principle components

eigenvectors2 = EigenValuescorrelation$vectors[,1:4]
eigenvectors2
##             [,1]        [,2]       [,3]        [,4]
## [1,]  0.30219096 -0.33410693 -0.1120450  0.54165059
## [2,] -0.64378667 -0.03435809 -0.1099391  0.04629318
## [3,] -0.62260719 -0.28382920 -0.1631733  0.15486709
## [4,]  0.21242839 -0.35948683 -0.2482280 -0.69082649
## [5,]  0.06848339 -0.73791663 -0.2135991  0.06717140
## [6,]  0.22694272  0.22065738 -0.5485203  0.39570047
## [7,]  0.07253163 -0.29015800  0.7352670  0.20531846

From the above graph, we can observe after the seventh value, plot is stable. So key values can be taken as 1:4. From Variabilitycor we can observe range also

colnames(eigenvectors2) = c("e1", "e2", "e3","e4")

PC1 <- as.matrix(scaled_data) %*% eigenvectors2[,1]
#PC1
PC2 <- as.matrix(scaled_data) %*% eigenvectors2[,2]
#PC2
PC3 <- as.matrix(scaled_data) %*% eigenvectors2[,3]
#PC3
PC4 <- as.matrix(scaled_data) %*% eigenvectors2[,4]
#PC4
PC <- data.frame(PC1, PC2, PC3,PC4)
head(PC)
##         PC1        PC2      PC3       PC4
## 1 0.5827370 -0.8443937 1.373031 1.6137623
## 2 0.4745270 -0.7847161 1.251893 1.7923547
## 3 0.6339243 -0.6943646 1.179064 1.8176923
## 4 1.0476815 -1.0874105 1.393035 0.9821774
## 5 0.8730591 -0.6585232 1.455354 2.3344808
## 6 0.8470899 -0.9348971 1.576211 1.4739631

Result - From the result we can observe the data of key components for all variables, In the first one, third one, fourth one all the values are related as all of them are positive.

Observations

Learnings - This dataset helped me learn about different types of crops, climatic conditions and land conditions we check for a particular plant to grow, I also came to know that certain plants only grow in few climatic conditions . And also I learned to visually represent some data we need to understand and we should be clear about what we need to present. And from matrix operations I came to know about difference between covariance and correlation.

Limitaions- This dataset is very useful for farmers and agriculture students, but the data is limited and relationship between the variables is not clear, these looks like random conditions not dependent ones.

Future Work - After analyzing the data I came to conclusion that It has more scope like we can mention which region or which country these crops will grow and also more data is required to check the relation between variables.

Conclusion - This projects helps in finding what amount of rainfall and temperature is needed for a plant to grow and also land conditions included like nitrogen,ph. Although only few variety of plants are mentioned if research is done in a better way this data set has more scope.

References

Below reference helped me to get proper understanding about why this data is collected and how comparision is done in different categories.

https://www.kaggle.com/code/atharvaingle/what-crop-to-grow

ph value reference for one sample test -

https://nutrientstewardship.org/implementation/soil-ph-and-the-availability-of-plant-nutrients/#:~:text=It%20has%20been%20determined%20that,compatible%20to%20plant%20root%20growth.

t-distribution table -

https://www.google.com/search?q=T+table&tbm=isch&source=iu&ictx=1&vet=1&biw=1387&bih=693&dpr=1#imgrc=Ak3E8SGWtJZSvM